Clustering Large Databases with Numeric and Nominal Values Using Orthogonal Projections

نویسندگان

  • Boriana L. Milenova
  • Marcos M. Campos
چکیده

Clustering large high-dimensional databases has emerged as a challenging research area. A number of recently developed clustering algorithms have focused on overcoming either the “curse of dimensionality” or the scalability problems associated with large amounts of data. The majority of these algorithms operate only on numeric data, a few handle nominal data, and very few can deal with both numeric and nominal values. Orthogonal partitioning Clustering (O-Cluster) was originally introduced as a fast, scalable solution for large multidimensional databases with numeric values. Here, we extend O-Cluster to domains with nominal and mixed values. O-Cluster uses a topdown partitioning strategy based on orthogonal projections to identify areas of high density in the input data space. The algorithm employs an active sampling mechanism and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their explanatory power, and OCluster’s good scalability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Mixed Data via Diffusion Maps

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. Although many clustering algorithms have been proposed most of them deal with clustering of numerical data. Finding the similarity between numeric objects usually relies on a com...

متن کامل

Unsupervised Learning with Mixed Numeric and Nominal Data

ÐThis paper presents a Similarity-Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy [15], that gives greater weight to uncommon feature value matches in similarity computations and makes no assumptions of the underlying distributions of the feature values, is adopted...

متن کامل

خوشه‌بندی خودکار داده‌های مختلط با استفاده از الگوریتم ژنتیک

In the real world clustering problems, it is often encountered to perform cluster analysis on data sets with mixed numeric and categorical values. However, most existing clustering algorithms are only efficient for the numeric data rather than the mixed data set. In addition, traditional methods, for example, the K-means algorithm, usually ask the user to provide the number of clusters. In this...

متن کامل

Text Document Cluster Analysis Through Visualization of 3D Projections

Clustering has been used as a tool for understanding the content of large text document sets. As the volume of stored data has increased, so has the need for tools to understand output from clustering algorithms. We developed a new visual interface to meet this demand. Our interface helps non-technical users understand documents and clusters in massive databases (e.g., document content, cluster...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004